home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 1
/
Cream of the Crop 1.iso
/
EDUCATE
/
SP12EXE.ARJ
/
SPELCHEK.DOC
< prev
next >
Wrap
Text File
|
1991-03-28
|
12KB
|
283 lines
SPELCHEK Version 1.2 - A *FAST* spelling checker by Edwin Floyd. 3-28-91
Version 1.2 implements a new, faster dictionary algorithm which
is incompatible with previous versions. Please rebuild all user
dictionaries with MAKEDICT. SPELCHEK is distributed in three files:
SP12EXE.ZIP - Executable programs
SP12DCT.ZIP - Dictionaries (large file)
SP12SRC.ZIP - TP6.0 source code to all programs
Purpose of SPELCHEK
-------------------
SPELCHEK extracts words from an input file, or several input files,
and checks them for membership in a superimposed code dictionary.
Any words not found in the dictionary, it writes to an output file,
one per line. The program recognizes a number of options for:
o High-order bit stripping
o Appending additional information to the output word list
o Defining the characters comprising a "word"
How to run SPELCHEK
-------------------
From the DOS command line enter:
SPELCHEK filenames [-H] [-M] [-W[+/-]abc..] [@name] [-Uname]
[-Oname] [-Ppath]
Spaces delimit command line parameters. You may intermingle
input text filenames and options (mark each option with a leading
hyphen). Filenames may include wild-cards. Some options (-W,-O,-U)
allow a character string or filename to follow the option letter.
This must follow with no intervening spaces or the program will
mistake it for an input file name. Some options (-H,-M) allow a "+"
or "-" to indicate "on" or "off". This also must follow with no
intervening space, and "+" is assumed if it is omitted. You may
place options and filenames in an ASCII "include" file and
specify its name with a leading "@" on the command line. An
include file may contain references to other include files. You
also may specify default options, filenames and include files in
the DOS environment using "SET SPELCHEK=...". For example:
SET SPELCHEK=-H+ -Owords.out -W-ABCDEFGHIJKLMNOPQRSTUVWXYZ
SET WORDS=@defaults.spc -O
SPELCHEK processes options left-to-right, first from the DOS
environment, then from the command line. Where options conflict,
the last option processed prevails. Thus, you may override "SET"
environment options on the command line.
What the options mean
---------------------
-H[+/-] Clear the high-order bit on each input character
(default off). Use this option to process files
created by word processing programs, like WordStar,
that mark some letters by setting the high-order
bit, often at the beginning or end of a word.
-M[+/-] Append markup information to output word list. This
causes the program to insert a number in front of each
word written to the output file. The number indicates
the byte position in the input where the offending word
begins. The first byte in the input file is position 1.
Also, the program writes the file name at the beginning
of the word list for each input file. The file name is
preceded by a zero and a space. This output file is
intended as input to a program such as MARKDOC which
marks misspelled words in the input document.
-P[path] Indicate the drive and directory containing the
master dictionary files. There are seven master
dictionary files: AB.DCT, CD.DCT, EH.DCT, IN.DCT,
OR.DCT, ST.DCT and UZ.DCT. They all must reside
in the same directory. If no -P path is specified,
the master dictionary files must reside in the current
directory or the program directory. The master dictionary
files were created with MAKEDICT (see below) from a list
of over a hundred thousand words obtained from from Public
Brand Software, 1-800-IBM-DISK.
-U[name] Name a user dictionary file. This option specifies the
name of an existential dictionary file produced by the
MAKEDICT program. You may specify the drive and full path.
If a simple file name is specified, the file is assumed to
be in the current directory. If SPELCHEK can't open the
user dictionary, it issues a warning message and processes
the input files against the master dictionaries only.
-W-abc.. Replace the "word character set" with the indicated
characters. The program checks each character in
each input file for membership in the word character
set and defines a "word" as an uninterrupted
sequence of at least one but no more than 255
characters which are members of that set. The
default is the set of upper and lower case
alphabetic characters.
-W+abc.. Add additional characters to the word character set.
-O[name] Name the output file. If the name is omitted ("-O "),
output goes to "StdOut" and is available for DOS a
pipe (|) or redirection (>). StdOut is the
default.
-O- Suppress output. -Onul also suppresses output. The
program will still display word counts on the
screen.
Three examples
--------------
1. Generate list of all misspelled words in the document named
MYSTORY.DOC and write the list to file MYSTORY.BWD. The following
are equivalent:
SPELCHEK mystory.doc -Omystory.bwd
SPELCHEK mystory.doc >mystory.bwd (default StdOut)
SET SPELCHEK=-Omystory.bwd (set defaults)
SPELCHEK mystory.doc
If at this point we want an alphabetic, un-duplicated list of misspelled
words, we can use the WORDS program (see WORDS.DOC for other uses):
WORDS mystory.bwd -omystory.unq -a
2. Generate list of misspelled words in the documents named
HISPHYS.WS and OPREPORT.WS and use the list as input for MARKDOC to
mark misspelled words in both documents. The files are WordStar
documents and we wish to check a user dictionary called MEDTERM.DCT
in the current directory. The main dictionary files reside in
directory: D:\SPELL.
SET SPELCHEK=-Pd:\spell -H -O -M -Umedterm.dct
SPELCHEK hisphys.ws opreport.ws | MARKDOC
We could have specified all the options on the command line.
Ordinarily you should set the -P and -U options in the environment.
3. Generate an alphabetized, unduplicated list of misspelled words in
all the documents in the C:\SPDOC directory. Dictionaries and
parameters are as in example two.
SET SPELCHEK=-Pd:\spell -H -O -M -Umedterm.dct
SPELCHEK c:\spdoc\*.doc -ospelchek.bwd
WORDS spelchek.out -ounique.bwd -a
File UNIQUE.BWD now contains the alphabetized list of unique, misspelled
words from all *.DOC files in the directory.
Networks
--------
FYI, network users, SPELCHEK opens its input files in "Read, Deny
None" mode, @include files "Read, Compatibility", and the output
file in "Write, Compatibility". Only one input file at a time is
open, except during processing of nested @include files.
MAKEDICT
--------
MAKEDICT creates an optimal existential dictionary (Bloom filter) which
can be used by SPELCHEK with the "-U" option (see above). From the DOS
command line, enter:
MAKEDICT infile [bits] [extra]
The input file should be a list of words, one per line. All
characters should be upper case if the dictionary is intended
for use with SPELCHEK. The second parameter, "bits", specifies
the number of bits to superimpose for each input word. The
number of bits partly determines the accuracy of the dictionary.
For use with SPELCHEK, specify the default, 14 bits. The third
parameter, "extra", specifies an allowance of extra space so words
may be added to the dictionary and it still remain within the
accuracy specified by the "bits" parameter. The default is zero.
The output file is given the same name as the input file, except
the extension is ".DCT". If the input file extension is ".DCT",
the output file is given the extension ".DIC".
To create a user dictionary for SPELCHEK, only the input file need
be specified. The defaults for "bits" and "extra" are exactly what
is required for a user dictionary. Example:
MAKEDICT medterm.lst
This creates a user dictionary called: MEDTERM.DCT suitable for use
by SPELCHEK.
MAKEDICT prints dictionary statistics, including the odds against
incorrectly recognizing a word which is not in the dictionary. Please
remember, a Bloom filter is a probabilistic technique; collisions are
possible, but you control the collision probability by the bits setting.
All main dictionaries were created with 14 bits, corresponding to about
a 1/16384 chance of collision. When you specify a user dictionary, the
odds increase to 1/16384 plus the user dictionary odds. Thus, a 14-bit
user dictionary would increase the odds of a collision to about 1/8192.
This means, on the average, SPELCHEK will miss about one out of every
8192 different misspelled words. For instance, if a really bad speller
misspells (differently) about every tenth word in an 80,000-word
document, SPELCHEK may miss one of the misspellings.
MARKDOC
-------
MARKDOC reads the output file produced by SPELCHEK with the -M+
option and marks misspelled words in the input files. From the
DOS command line, enter:
MARKDOC [markchars] [<infile]
MARKDOC reads its standard input file (STDIN). Each input line
begins with a number. The number zero is always followed by a
document file name. Each non-zero number indicates the position
of the first character of a misspelled word in the current
document file. MARKDOC reads each document file and writes an
output file which is the same as the input file, except each
misspelled word is preceded by "mark" characters. The
default mark character is a single "#", but you may specify
mark characters as a parameter on the command line. Examples:
SPELCHEK document.fil -M+ | MARKDOC %@
SPELCHEK -M+ document.fil -Omark.$$$
MARKDOC <mark.$$$
MARKDOC saves a copy of the document file under the same name
as the original document except with the extension ".BAK".
Note: MARKDOC expects to read a file produced by SPELCHEK with the
-M+ option. If this option is not set, MARKDOC will abort with a
Pascal error 106. MARKDOC is intended as a demonstration of one
use of the -M+ output file. Its crash resistance should be
improved before it's let out into the real world.
WORDS
-----
WORDS is a word extractor program useful for creating word lists for
MAKEDICT, among other things. See WORDS.DOC for documentation.
Legal Stuff
-----------
SPELCHEK.EXE, MAKEDICT.EXE, MARKDOC.EXE, WORDS.EXE, SPELCHEK.DOC,
and WORDS.DOC and all source code files, dictionaries, and word
lists are:
Copyright (c) 1990,91 by Edwin T. Floyd,
All rights reserved.
SPELCHEK is copyrighted "free" software. The author hereby
expressly permits and encourages individuals to use SPELCHEK at
home and at work and to distribute it without charge. The author
prohibits distribution of SPELCHEK for profit, or as a part of a
product sold for profit, except where explicit written permission
has been obtained from the author for such distribution. Also,
users groups and shareware libraries charging a disk duplication
fee not exceeding $10.00 may distribute SPELCHEK.
The author makes no warranties of any kind, either expressed or
implied, as to mercantability or fitness for any particular
purpose. SPELCHEK, et. al., are available as is and in no event
will the author be held liable for damages, including any lost
profits or incidental or consequential damages, even if the author
has been advised of the possibility of such damages.
Authorship
----------
SPELCHEK was written in Turbo Pascal v6.0 by:
Edwin T. Floyd [76067,747] (CompuServe)
#9 Adams Park Court 404/576-3305 (work)
Columbus, GA 31909 404/322-0076 (home)
The latest version of SPELCHEK is available on CompuServe in
the IBMAPP forum, and on a number of bulletin boards around the
country.
- Edwin - 3-28-91
Revision History
----------------
05-13-90 V1.0 ETF Original release & DDJ submission.
01-10-91 V1.1 ETF Test version, Bloom filter CRC algorithm (not released)
03-28-91 V1.2 ETF Update for TP6.0 and second public release